Data visualtisation Acquiring data is only half the battle, you must communicate that data in an effective manner. Aesthetics matter, to a point, the goal of effective data visualization is to get out of the way and let the data speak for itself, without misrepresenting the data. There are some great resources available out there on data visualisation based upon Edward Tufte’s principles:
R Notebooks
In this session we will make use or R Notebooks.
- From the
Filemenu, chooseNew fileand thenR Notebook. A new window will open containing an example notebook. saveyour notebook as my_notebook.rmd in the notebook folder using thesave asfunction in theFilemenu of Rstudio.
Notebook basics
Notebook includes “chunks” of text, and chunks of R code.
- unlike console/rscript text is treated as text not code
- code must be placed in rchunks
- To place an R chunk, click
insertand thenR. Or alternatively CNTRL+ALT+i
R Chunk
- code is placed in the chunk e.g.
1+1 - Too run a chunk, click the green arrow in the chunk, the output appears below, and stays there even when you run another chunk
Run R Chunk
- Try it yourself, crearte a chunk and copy the following code into the chunk, between the formatting marks.
1+1
[1] 2
Setup
We need to install additional packages to run this course. You already have tidyverse but just in case you don’t put the following in the console and press enter:
install.packages('tidyverse')
The other packages we need are:
install.packages("cowplot")
install.packages("ggbeeswarm")
install.packages("readxl")
install.packages("ggpubr")
This might take a couple of minutes, if prompted with the phrase ‘compile from source’ select no. Although yes will also work it will just take longer to install.
Now load the packages required for this course:
library(tidyverse)
library(readxl)
library(cowplot)
library(ggbeeswarm)
#library(gtable)
#library(grid)
#library(gridExtra)
library(ggpubr)
Data
For today’s dataset we will data that was generated during my postdoc @ University of Manchester. I investigatied the immune response in the pleural cavity to a falarial nemadote, Litmosoides sigmodontis (lito for short). This data is a metaanlsysis of circa 600s mice compiled from many individual experiments. Our research question involved comparing how 2 strains of mice BALB/c and C57BL/6 responded to infection. Obviously it was a bit more complicated than that but for the purpose of this analysis all you need to know is that a BALB/c mouse is susceptible to infection and a C57BL/6 mouse is resistant. In this course we will visualize immune cell dynamics in the pleural space of naive and infected mice from both strains.
Lito infection is strain-dependent
Load data
Data is saved as an excel file. This is a fine example of dirty messy data created during research.
messy
This contains data from multiple experiments (causing batch effects), it has missing data for some variables. It has variables whose name does not convey much meaning. This includes the variables ‘OK for strain comparison’ and ‘remove from analysis (justified)’. These are QC columns that would not appear in publication quality datasets but often appear in raw datasets. In this respect we are adhering to a core principle of data science never touch the raw data. In truth, this is not really raw data, it is a compilation of raw data (cell counts, worm counts, mouse IDs) and analysed data (produced in flow cytometry analysis) painstakingly complied into a curated dataset. However because it was is precious in it’s current form we will treat the excel file as sacrosanct - do not edit.
To make things simple for this course, I have deleted most of flow cytometry data leaving only the % proportion of each cell type.
Now lets load the data. This will build on what you learned in course 1. There are too many columns (cols) to parse each one individually. The read_excel() function will automatically parse the data and assign it to the correct data type (we hope). Much easier than specifying each col.
#make sure you don't have the file open in excel!
raw <- read_excel("./data/plueral_cavity_database.xlsx")
What size is the tibble? Lets use the dim() or dimension function. Btw, a tibble is the tidyverse version of a data frame.
dim(raw) #rows x cols
[1] 666 29
Each row represents a mouse and each col an individual datapoint about that mouse. Lets print the colnames with the automatically assigned datatype by the read_excel() function.
print(sapply(raw,class)) # this is an example of a vectorised function use in R
Global ID Experiment
"numeric" "character"
Mouse Experimental Group
"numeric" "character"
Infection Sex
"character" "character"
Strain Background
"character" "character"
Genotype Treatment
"character" "character"
OK for strain comparison Timepoint
"character" "numeric"
Remove from Analysis (justified) Day 0 Age
"character" "numeric"
Age Live worms
"numeric" "numeric"
Cell Count B cells
"numeric" "numeric"
Eosinophils Neutrophils
"numeric" "numeric"
CD4+ T cells CD4- T cells
"numeric" "numeric"
T cells DC
"numeric" "numeric"
Myeloid cells Monocytes
"numeric" "numeric"
SCM Converting SCM
"numeric" "numeric"
LCM
"numeric"
Its too long to print inline but you can look at it in a spreadsheet format using view(raw). Looking at the col names we can see that some of the cols are character vectors (categorical). These include cols that are likely to be metadata, such as ‘Experiment’ (which experiment the datapoint comes from). Other character variables could be considered to be results, ‘sex’ and ‘age’. Most of the variable are numeric. The numeric data includes cell counts, worm counts and flow cytometry data. The flow cytometry data was generated using the following gating, and expressed as % of total live cells.
Data cleaning/wrangling
Now lets clean up the data. It is much better to do this in R than in excel as we never manipulate the data on the disk only in memory.
Note: you will see that $ operator is used to select a variable e.g.raw$age however when there is a space in the variable name such as ‘Live worms’ you need to place the variable in backquotes (below ESC key, you have to double press the key). Rstudio makes this easy for you because when you open the $ operator on a dataframe/tibble it opens a dropdown menu you can navigate with the arrow keys - try this yourself in the console.
QC
Lets remove data that did not pass QC. Lets investigate what is inside these QC variables and remove the bad data.
table(raw$`Remove from Analysis (justified)`) # table() function counts unique elements and returns a table
F T
642 24
table(raw$`OK for strain comparison`)
F T Y
180 484 2
Note we can see an error in the dataset, this is common for datasets made by humans. We have F (false), T (True) but also Y (Yes). We could correct Y to T and include those two values but let’s just move on.
#notice for the first col I did != (NOT equal)
raw <- raw %>% filter(`Remove from Analysis (justified)` != "T") %>% filter(`OK for strain comparison` == "T")
our dataset is now smaller and saved in memory.
dim(raw)
[1] 474 29
We are left with 474 mice for further analysis, nice!
Objectives
We want to produce plots of different immune cells populations that adheres to some of Tufte’s principles of good data visualization even if it takes time to get there. These plots look nice, if I don’t say so myself. Let’s create something similar! .
If you look at bar chart on the top right for Myeloid cells what would happen if we tried to plot this using the most basic ggplot code? Don’t worry about the code we will go through that in detail.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`)) + geom_point()
Lets try and change the plot to replicate the plot on the bottom right, Meyloid cells by timepoint
raw %>% ggplot(aes(x = Timepoint, y=`Myeloid cells`)) + geom_point()
Ok, so this doesn’t look very nice, does it? ggplot’s basic plotting functions actually look terrible, but don’t worry we will make this plot look much better.
What we need to do is take a step back and descripe what we want to do.
If we were to describe these plots we would first say that they are separated by genetic background - C57BL/6 and BALB/c. The bar charts are also separated by infection status - Naive and Infected. So we need to tell R to handle these factors.
We would also add that we want to colour the plots red and black, spread out the points, add boxplots, lines, and statistics and just make the plots look nicer with better backgrounds and text.
When programming always breakdown this to simple tasks
Bar charts
- Create category for strain (Background), and separate points by strain
- Colour points by strain
- Spread out those points
- Add boxplots
- Add statistics
- Increase text size
- Improve general aesthetics
Scatter plots
- Create category for Infection status
- Add line on scatter plot
- perform linear regression analysis
Create categories
Factorising
Factors in R are basically a way of assigning categories to the data, so instead of being stored as a new character string for each row, they are given a unique ID for each factor that is used repeatedly. This has lots of benefits, not least it is computationally cheaper requiring less memory.
# see what the options are.
table(raw$Infection)
Infected Naive
313 161
table(raw$Background)
BALB/c C57BL/6
201 273
if we look at the current class of the variable it is a character vector and has no factor ‘levels’
class(raw$Infection)
[1] "character"
levels(raw$Infection)
NULL
Now we will factorize and check the class of the updated variable
raw$Infection <- factor(raw$Infection,levels = c("Naive", "Infected")) # notice we define the order
raw$Background <- factor(raw$Background,levels = c("C57BL/6", "BALB/c"))
class(raw$Infection)
[1] "factor"
We can investigate the facrors using the levels() function
levels(raw$Infection) # we could also run this like this : raw$Infection %>% levels()
[1] "Naive" "Infected"
Plotting
Our goal here is to iteratively improve the aesthetics and readability of our plots following the list we made above.
We will use ggplot which is primary reason R has become so popular over the last few years amongst data scientists, but especially biologists.
For the basics of ggplot check out an excellent cheat sheet
Basic plotting
We will do this by adding layers to the ggplot using the + operator. Let’s start with the basic ggplot function. Here we pass the dataset to ggpot but not told ggplot to do anything, so all we get is a blank plot area.
raw %>% ggplot(aes())
Next we tell ggplot to plot infection status (as a categorical variable) on the x axis and % myeloid cells on the y axis. For this we use the aes() within the ggplot function. However while ggplot shows those axies it doesn’t show any data. That is because ggplot has no default way it displays data, we must tell ggplot to plot a bar chart or plot points, etc.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`))
Adding points
Below we tell ggplot to plot points using the geom_point() function. Note that we use an + to add additional elements to the graph, these are called layers. Finally we see the data. It’s not very attractive however, with the points all bunched up and we can’t really infer anything about data.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`)) +
geom_point()
boxplots
Boxplots are an excellent way to view data which have high n number. We add these using geom_boxplot()
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`)) +
geom_point() +
geom_boxplot()
Next we want to add colour to the plot. To do this globally (i.e. to all layers) we add it within the aes() function as aes(col=Background). This colours both our boxplots, points and legend by genetic background.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_point() +
geom_boxplot()
fancy points
The col argument actually separated out boxplot but not the points. For this we will make use of a new R package called ggbeeswarm. You can do this without an additional package but I like this method as the way the points get distributed is very similar to the way Prism handles points.
raw %>% ggplot(aes(x=Infection, y= `Myeloid cells`, col=Background)) +
geom_boxplot() +
geom_quasirandom(method = "tukeyDense", dodge.width = 0.75)
Much nicer! Now let us change the colours to black and red.
Grasping how programming languages handle colour can be a bit complex. The use of colour in data visualisation is an entire research field in itself. Randomly picking colours you think might go well together is a recipe for disaster. Lots of pre-made colour palettes exist for this reason. That said, we are just going to pick red and black! These are pretty contrasting and unlikely to cause confusion.
Define a colour palette in R
There are a number of ways of doing this, this is the simplest and embedded within ggplot without the need for another package.
cols <- c("black","firebrick3", "purple", "green") # just a character vector, does nothing by itself, we only have 2 categories but we supply 4.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
geom_boxplot() +
scale_color_manual(values = cols) # function that looks up these colours
If we wanted to reuse the same palette in many plots we can extract the function and save it in R. Programming is all about being concise and anything that reduces copy and pasting will reduce error.
pal <- scale_color_manual(values = cols)
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
geom_boxplot() +
pal # more consise
But what about the points being behind the boxplot - simple, reorder the layers!
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
pal
Editing thematic elements
We can change the axis titles, and just about anything else about the plot.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
ggtitle("% Macrophages") +
pal
We can add bespoke themes such as cowplot using theme() function and its derivatives.
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
pal +
theme_cowplot(font_size = 24)
Statistics
R is the preferred programming language for statistics because it comes inbuilt with lots of stats functions and was largely built by statisticians and ecologists who are basically statisticians.
We can run a t test for example using the t.test() function.
for_t <- raw %>% filter(Infection == 'Infected')
t.test(`Myeloid cells` ~ Background, data = for_t)
Welch Two Sample t-test
data: Myeloid cells by Background
t = 12.235, df = 273.23, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
15.29692 21.16390
sample estimates:
mean in group C57BL/6 mean in group BALB/c
41.11024 22.87983
Nice, but if we want to replace other graphing programs it would be nice to do this automatically on the plot. We can use the R package ggpubr to do this. Again we can store the function with specific arguments as an r object for future use.
comp <- stat_compare_means(method = "wilcox.test", label = "p.signif", label.x.npc = "middle",size=8, vjust = 1,show.legend = FALSE)
t <- theme_cowplot(font_size = 24)
raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
pal + comp + t # much more concise
Saving plots
Now that we have perfected our graph we need to save it. We save the plot as an object and print that to a file. We can do this with the ggsave function.
p <-raw %>% ggplot(aes(x = Infection, y=`Myeloid cells`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
pal +
stat_compare_means(method = "wilcox.test", label = "p.signif", label.x.npc = "middle",size=8, vjust = 1,show.legend = FALSE) +
theme_cowplot(font_size = 24)
ggsave(p, filename = "results/Myeloid_cells.pdf", width = 8, height = 5)
Advanced plots.
We have created a plot which is as high in quality as anything produced by prism or excel. R can create far more diverse plots than those programs, and you gain much more control over the visual aspects of the plots. However, it is a lot more complicated than using those GUI programs. The value of R (or Python) lies in not needing to manipulate the underlying data, e.g. by copy and pasting to create groups. Moreover we can perform transformations (functions) on the data simply, or change what we are looking out. We could even write a program that went through all your data and produced plots automatically - no more cumbersome copy and pasting.
We can perform maths on our data. Here we have been looking at % myeloid cells but maybe we want to look at total number of cells. \[\frac{ Cell Perentage}{100} * CellCount = Total number\]
We can do this in two ways:
- Create a new col by multiplying
Myeloid cellsdivided by 100 by byCell Countusing themutate()function:
raw <- raw %>% mutate(`Myeloid cell number` = (`Myeloid cells`/100) * `Cell Count` )
raw %>% ggplot(aes(x = Infection, y=`Myeloid cell number`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
pal + comp + t
- Perform the same maths within ggplot
raw %>% ggplot(aes(x = Infection, y=(`Myeloid cells`/100)*`Cell Count`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
pal + comp + t
Lets try to change the variable we are graphing. Remember you can print the variables using colnames(raw) function. ‘Live worms’ sounds interesting, let’s graph that, you can graph whatever you want.
raw %>% ggplot(aes(x = Infection, y=`Live worms`, col=Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
xlab("") +
pal + comp + t
We can easily change the graph type to a scatterplot and add a linear regression
raw %>% ggplot(aes(x = Eosinophils, y=`Live worms`,col=Background, add = "reg.line")) +
geom_point() +
geom_smooth(method = 'lm') +
stat_cor(label.y = -2) +
pal + t + facet_wrap(~Background)
We can also look at change over time
raw <- raw %>% mutate(`Myeloid cell number` = `Myeloid cells` * `Cell Count` )
raw %>% ggplot(aes(x = Timepoint, y=`Myeloid cell number`, col=Background)) +
geom_point(position = position_jitterdodge(jitter.width = 10)) +
geom_smooth() +
pal + comp + t
ggsave(filename = "results/Myeloid_cells_timepoint.pdf", width = 8, height = 5)
Super hard challenge.
Taking all you have learned. Create a new plot:
- filter the raw dataset to include only infected samples. Hint: in the
Infectedcol - filter to only include mice taken after day 35 of infection. Hint: using the
Timepointcol and you will need the logical operator> - Calculate total numbers of neutrophils in the dataset. Hint: use
mutate()function - Using a boxplot, plot number of neutrophils on the y axis and Background on the x axis
- Change the colours to ‘blue’ (C57BL/6) and ‘green’ (BALB/c)
- Add statistics to the plot
- Add a title ‘I love Neuts’
- Save the plot as a pdf file
- What does this plot tell you?
Hint: most of this can be copy and pasted from above.
Solution below: but don’t look!
Public datasets
Typically data from flow cytometery experiments don’t get uplifted to public datasets. It is much more common to upload proteomic data transcriptomic data. This partly reflects that journals have requirements that data from these experiments should be uploaded to databases such as NCBI. This has led to a transformation in modern biology - with so much data freely available, it is often unnesscary to even do an experiment! More realistically, experiments are still critical but these datsets are used as a form of hypothesis generation. Allowing scientists to test ideas in silico before spending resourses in the lab.
Here we will download a public dataset from NCBI and reanalyse part of that datset. We will use data generated by a microarray which is a well established, if ageing, method of measuring RNA transcripts in a tissue sample. It uses probes to detect mRNA molecules isolated from cells.
The data generated from microarray (and RNA-seq) is typically put through a differential gene expression analysis pipline which compare two groups of samples (e.g. control group, manipulated group) to produce data which contains a fold change and statistical tests for each gene in the array (often 10 of 1000s of genes).
Downlaod the data
So we will use a dataset from a paper I liked. This is a complicated paper but we want to take a take a small subset of the data and use it. This is for a macrophage in the peritoneal cavity called a large peritoneal macrophage (LPM). This is similar to the LCM in my dataset above.
We want to compare the transcriptome between LPM from WT mice given no stimulus and WT LPM given IL-4 complex a cytokine which activates type 2 immune responses and in macrophages is said to lead to ‘alternative activation’ (M2, M(IL-4) activation)
The paper has a link to the data in the methods section of the paper: GSE125727.
This brings us to a NCBI page for the datasets. For microarray datasets we can easily reanalyse these data using an R script built into the website called “Geo2R”. Analysis of RNA-sew data is a little more involved, but we will ignore that in this course.
I have highlighted the relevant samples. To start click geo2R.
On the next page, we define our groups. Do this by clicking ‘define groups’ typing 2 names, “naive” and “IL-4c”. To assign samples to those groups, cntrl or shift click the relevant sameps and then click on those groups.
How to assign groups
Then scroll down to the bottom of the page and click “Analyze”. This will tell the server to run the r script using the argument you provided (group information). You can actually view the R script if you like, or even copy it into R to run it on your own computer. Although you will need to install more packages, and the code might be a bit over your head at this stage. The analysis is a version of differential gene expression which looks at genes which are up or down in two groups.
This will take a couple of minutes to run, but eventually you will get a lot of outputs. What it will produce is a table of genes as rows, with various columns including 2 we are interested in, Fold change and P values.
Super cool outputs!
You can download these pictures if you like but what you really want is the table file. So select ‘download full table.’ This will download the a tsv file. A tsv file is like a csv file expect the data is separated by tabs not commas.
Move the file into the data folder for this project.
load in data
Here we will load in the data (assumming you were able to download it and put it in the right place). Unlike before we have to tell R that the tabs separate the data so the function needs another argument, sep="\t"
micro <- read.csv(file = "data/GSE125727.top.table.tsv", sep = "\t")
Lets investigate the file
dim(micro)
[1] 30177 8
So we have a lot or rows and 8 coloumns. Let’s look at the columns first.
colnames(micro)
[1] "ID" "adj.P.Val" "P.Value" "t" "B"
[6] "logFC" "Gene.symbol" "Gene.title"
Hmmm, this looks fairly understandable. The cols we will be interested in are "logFC’ or log base 2 of fold change, Gene.symbol, and adjusted P value. We could select only that data to simplify matters.
micro <- micro %>% select(c("adj.P.Val","logFC","Gene.symbol"))
Now our dataset micro contains only three cols. The dataset contains 30177 rows, likely corresponding to genes (in reality they correspond to probes not genes, so the same gene may appear multiple times). We would not write out 30177 genes but we can use the head function to look at the first few.
head(micro$Gene.symbol)
[1] "AA467197" "Rnase2a" "Chil4" "Cst7" "Pilra" "Tmigd1"
Cool! They look like genes to me! Chil4 is one of my favourite mouse genes too!
volcano plot
We will produce a basic volcano plot in R using ggplot2. We will need one more R package called ggrepel. Install this using install.packages("ggrepel").
library(ggrepel)
The most basic volcano plot can be made with the following code:
micro %>%
ggplot(aes(x =`logFC`,
y=-log(`adj.P.Val`) # do you notice the -log function?
)) +
geom_point() + theme_cowplot()
We can of course make this look better! Lets creat a new col that that can be used to highlight certain highly differential genes. The code from here on down might be a bit more complicated.
# mark up/down regulated genes
micro <- micro %>%
mutate(gene_type = case_when(logFC >= 2.5 & adj.P.Val <= 0.05 ~ "up",
logFC <= -2.5 & adj.P.Val <= 0.05 ~ "down",
TRUE ~ "ns"))
So what did that create? Let’s look
head(micro$gene_type)
[1] "down" "down" "down" "down" "up" "down"
We will also create some variable that we can use for advanced plotting
# Add colour, size and alpha (transparency) to volcano plot --------------------
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey")
sizes <- c("up" = 2, "down" = 2, "ns" = 1)
alphas <- c("up" = 1, "down" = 1, "ns" = 0.4)
micro %>%
ggplot(aes(x =`logFC`,
y=-log(`adj.P.Val`),
fill = gene_type,
size = gene_type,
alpha = gene_type)) +
ggtitle('naive LPM vs IL-4 complex LPM') +
xlab('Log2 Fold Change') +
geom_point(shape = 21, # Specify shape and colour as fixed local parameters
colour = "black") +
geom_hline(yintercept = -log10(0.001),
linetype = "dashed") +
geom_vline(xintercept = c(-2, 2),
linetype = "dashed") +
scale_fill_manual(values = cols) + # Modify point colour
scale_size_manual(values = sizes) + # Modify point size
scale_alpha_manual(values = alphas) + # Modify point transparency
scale_x_continuous(breaks = c(seq(-5, 5, 1)), limits = c(-5, 5)) +
theme_cowplot() +
theme(legend.position = 'none')
That looks nicer. I won’t describe each line above, but you can read each line and see if you can work out what it does. You can comment out a line in the code using a hash
# at the start of a line to stop the code being run, but without deleting it. Try commenting out various lines to see what that does.
Adding text to the volcano plot
The last thing we want to do is add text to the graph. Now we do not want to label all of the genes or our plot will look terrible and take hours to run. We just want to highlight the top few differential gene and highlight them on the graph.
First lets select the most differential genes. First arrange by LogFC ignoring minus sign using the abs() function and the desc() to place them in decending order.
micro <- micro %>%
arrange( desc(abs(logFC)))
head(micro, 20)
adj.P.Val logFC Gene.symbol gene_type
1 9.67e-07 -8.391100 Rnase2a down
2 2.28e-07 -7.661585 AA467197 down
3 1.07e-05 -6.095598 Mgl2 down
4 4.65e-06 -5.747173 Cst7 down
5 9.71e-06 -5.659000 Chil3 down
6 2.96e-06 -5.504030 Chil4 down
7 5.56e-06 -5.436653 Timd2 down
8 2.64e-05 5.359922 Fpr1 up
9 3.42e-05 5.320450 Cxcl13 up
10 4.95e-05 -5.071113 Ucp1 down
11 4.75e-05 -5.016402 Klk1b11 down
12 2.10e-05 -4.988765 I830127L07Rik down
13 2.10e-05 -4.988538 Mall down
14 3.81e-05 -4.849938 Gm5150 down
15 8.41e-06 -4.813952 Spp1 down
16 9.77e-06 -4.772435 Gatm down
17 3.26e-05 4.520842 Cxcl2 up
18 9.10e-05 4.516250 Ccl3 up
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Now get the 32 top genes here.
genes_to_graph <- head(micro$Gene.symbol, 30)
What if we wanted to add some genes we like to this? Well we can just append the vector with some more gene symbols
genes_to_graph <- c(genes_to_graph, "Chil3", "Retnla", "Egr2", "Gapdh", "B2m" )
Next we want to create a new object which is a subset of micro that contains only the genes we are interested in. This is useful for telling ggplot to label only specific points. There are actually many ways of doing this and this is atually not a very intuitive method, but it works.
identify <- micro %>% filter(Gene.symbol %in% genes_to_graph )
Finally we graph this.
micro %>%
ggplot(aes(x =-`logFC`,
y=-log(`adj.P.Val`),
fill = gene_type,
size = gene_type,
alpha = gene_type)) +
ggtitle('naive LPM vs IL-4 complex LPM') +
xlab('Log2 Fold Change') +
geom_point(shape = 21, # Specify shape and colour as fixed local parameters
colour = "black") +
geom_hline(yintercept = -log10(0.01),
linetype = "dashed") +
geom_vline(xintercept = c(-2, 2),
linetype = "dashed") +
scale_fill_manual(values = cols) + # Modify point colour
scale_size_manual(values = sizes) + # Modify point size
scale_alpha_manual(values = alphas) + # Modify point transparency
scale_x_continuous(breaks = c(seq(-5, 5, 1)), limits = c(-5, 5)) +
geom_label_repel(data = identify,size=5, # Add labels last to appear as the top layer
aes(label = Gene.symbol),
force = 3,
nudge_y = 1,max.overlaps = 200) +
theme_cowplot(font_size = 18) +
theme(legend.position = 'none')
ggsave(filename = "volcano_.png",width = 8,height = 5,dpi = 300)
Try to reverse engineer some of that code. I admit it is likely a bit too difficult for you.
Wrap up
R is a very powerful tool for data visualisation, it is not just bar charts. Here is an example from my own research.
Try doing this in Prism/Exel
Look at the tidytuesday challenge to see what people can do.
You can pay money to learn to code, but all you need is Google. Try to do something, fail, Google it.
I am dyslexic so take all spelling and grammar in this document with a pinch of salt, the errors are essentially invisible to me! Thank you for the poorly paid PhD student’s for taking their time to help with this course.
Session info
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggrepel_0.9.1 ggpubr_0.4.0 ggbeeswarm_0.6.0 cowplot_1.1.1
[5] readxl_1.3.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6
[9] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.2
[13] ggplot2_3.3.3 tidyverse_1.3.1 rmdformats_1.0.3 knitr_1.33
loaded via a namespace (and not attached):
[1] httr_1.4.2 sass_0.4.0 splines_4.0.2 jsonlite_1.7.2
[5] carData_3.0-4 modelr_0.1.8 bslib_0.2.5.1 assertthat_0.2.1
[9] highr_0.9 vipor_0.4.5 cellranger_1.1.0 yaml_2.2.1
[13] lattice_0.20-44 pillar_1.6.1 backports_1.2.1 glue_1.4.2
[17] digest_0.6.27 ggsignif_0.6.1 rvest_1.0.0 colorspace_2.0-1
[21] Matrix_1.3-3 htmltools_0.5.1.1 pkgconfig_2.0.3 broom_0.7.6
[25] haven_2.4.1 bookdown_0.22 scales_1.1.1 openxlsx_4.2.3
[29] rio_0.5.26 mgcv_1.8-35 generics_0.1.0 farver_2.1.0
[33] car_3.0-10 ellipsis_0.3.2 withr_2.4.2 cli_2.5.0
[37] magrittr_2.0.1 crayon_1.4.1 evaluate_0.14 fs_1.5.0
[41] fansi_0.4.2 nlme_3.1-152 rstatix_0.7.0 xml2_1.3.2
[45] foreign_0.8-81 beeswarm_0.3.1 tools_4.0.2 data.table_1.14.0
[49] hms_1.1.0 lifecycle_1.0.0 munsell_0.5.0 reprex_2.0.0
[53] zip_2.1.1 compiler_4.0.2 jquerylib_0.1.4 rlang_0.4.11
[57] grid_4.0.2 rstudioapi_0.13 labeling_0.4.2 rmarkdown_2.8
[61] gtable_0.3.0 abind_1.4-5 DBI_1.1.1 curl_4.3.1
[65] R6_2.5.0 lubridate_1.7.10 utf8_1.2.1 stringi_1.6.2
[69] Rcpp_1.0.6 vctrs_0.3.8 dbplyr_2.1.1 tidyselect_1.1.1
[73] xfun_0.23
Answer to Challenge
#set palette up in advance
cols2 <- c("blue", "green")
pal2 <- scale_colour_manual(values = cols2)
raw %>%
filter(Infection == 'Infected') %>% #1. filter for infected data only
filter(Timepoint > 35) %>% #This is measured in days so 35 is day 35 of infecton
mutate(Total_neutrophils = (Neutrophils /100) * `Cell Count`) %>% #compute new variable for neutrophil numbers
ggplot(aes(x = Background, y = Total_neutrophils, col = Background)) +
geom_boxplot() + # moved up a line
geom_quasirandom(method = "tukeyDense",dodge.width = 0.75) +
theme_cowplot(font_size = 22) + # optional
pal2 +
stat_compare_means(method = "wilcox.test", label = "p.signif", label.x.npc = "middle",size=8, vjust = 1,show.legend = FALSE)
ggsave(filename = "./results/challenge_plot.pdf")